Week 1: Review!
PS 818 - Statistical Models
Anton Strezhnev
University of Wisconsin-Madison
September 3, 2025
Welcome!
\[
\require{cancel}
\DeclareMathOperator*{\argmin}{arg\,min}
\]
\[
\DeclareMathOperator*{\argmax}{arg\,max}
\]
Course objectives
- Give you the tools you need to understand descriptive inference via statistical models and comment on other researchers’ work.
- Equip you with an understanding of the fundamentals of likelihood and Bayesian inference to enable you to learn new models that build on these principles.
- Connect these principles to the particular research questions that you want to answer.
- Teach you how to program and implement estimators by yourself!
Course workflow
- Lectures
- Topics organized by week
- Lectures are the “course notes” – readings are the reference manuals.
- Readings
- Mix of textbooks and papers
- All readings available digitally on Canvas
Course workflow
- Problem sets (25% of your grade)
- Meant as a check on your understanding of the material and a way of communicating with me about the course
- Collaboration is strongly encouraged – you should ask and answer questions on our Ed discussion board.
- Graded holistically on a plus/check/minus system.
Course workflow
- Midterm and Final exam (25% and 40% of your grade)
- The midterm exam will be structured like the problem sets with two main differences:
- You have about 1 week to complete them instead of 2
- You may not collaborate with one another
- The final exam will be a written in-person exam
- Slightly more theory heavy, but some questions will require you to analyze code + output.
- Participation (10% of your grade)
- It is important that you actively engage with lecture and section – ask and answer questions.
- Do the reading!
- Participating on the discussion board counts towards this as well.
Assignment Timeline
- Problem Set 1: Assigned September 9, Due September 22
- Problem Set 2: Assigned September 29, Due October 13
- Midterm Exam Assigned October 14, Due October 20 (1 week)
- Problem Set 3: Assigned October 28, Due November 10
- Problem Set 4: Assigned November 11, Due December 1 (Extra time due to Thanksgiving)
- Final Exam: TBA (whenever/wherever I can book a room)
Class Requirements
Overall: An interest in learning and willingness to ask questions.
Assume a background in intro probability and statistics (1st year sequence)
- You should be comfortable thinking about basic estimands/estimators + their properties
- You should be able to interpret a confidence interval for (e.g.) a difference-in-means.
Some prior knowledge of causal inference helpful but not critical
- We’ll be connecting predictive models to causal estimands
- Ideally should be familiar with the potential outcomes framework
You should also be familiar with linear regression
- \(\hat{\beta} = (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}Y\) should be a familiar expression
- You should know under what conditions it’s unbiased for \(\mathbb{E}[Y|X]\), and under what conditions it’s efficient.
If you want some review, check out chapters 1-6 of “Regression and Other Stories”
A brief overview
Week 2-4: Introduction to likelihood inference and GLMs
- Concept of the likelihood, MLE as an estimator + asymptotic properties
- Binary outcome models, count models, duration models
Week 5-7: Bayesian Inference and Multilevel Models
- Principles of Bayesian inference – posteriors, priors, data
- Quantities of interest: posterior means, credible intervals
- Estimation via MCMC
- Application to multilevel regression models
Week 8: Survey data
- Applying multilevel regression methods to survey data
- Survey weighting to address non-random sampling.
Week 9: Mixture Models and the EM algorithm
Week 10: Item response theory and ideal point models
Week 11-13: Flexible regression (ridge/lasso, forests, kernels)
Week 14 Semi-parametric theory
Week 15: Big regressions!
Random variables
- Understanding the behavior and properties of random variables is at the core of statistical theory.
- (Simply put) a random variable \(X\) is a mapping from a sample space to the real number line
- Random variables have a distribution (which we may or may not assume we know) defined by the cumulative distribution function (CDF)
\[F(x) = Pr(X \le x)\]
Random variables
Discrete random variables take on a countable number of values (e.g. Bernoulli r.v. can take on 0 or 1) and have a probability mass function (PMF)
\[p(x) = Pr(X = x)\]
Continuous random variables take on an uncountable number of values (e.g. the Normal distribution on \((-\infty, \infty)\)).
- No PMF, but have a density function (PDF) that integrates to a probability
\[Pr(X \in \mathcal{A}) = \int_{\mathcal{A}} f(x)dx\]
Remember: PMFs (and PDFs) sum (integrate) to \(1\) over the support of the random variable.
Expectations
One important property of a random variable is its expectation \(\mathbb{E}[X]\). We’ll often make assumptions about the expectation of an R.V. while remaining agnostic about its true distribution.
- The expectation is a weighted average. For a discrete r.v. \(X\), we sum over the support of the random variable \(\mathcal{X}\).
\[\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x Pr(X = x)\]
For continuous r.v. we have an integral
\[\mathbb{E}[X] = \int_{x \mathcal{X}} x f(x) dx\]
Fun fact: we can get the expectation of any function of \(g(X)\) just by plugging it into the integral
\[\mathbb{E}[g(X)] = \int_{x \mathcal{X}} g(x) f(x) dx\]
Expectations
You’ll need to know some essential properties of expectations to simplify certain problems
Most important. Linearity. For any two random variables \(X\) and \(Y\) and constants \(a\) and \(b\)
\[\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]\]
Note that for any generic function \(g()\), \(\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])\). If \(g()\) is convex, by Jensen’s inequality \(\mathbb{E}[g(X)] \ge g(\mathbb{E}[X])\)
For a binary r.v. \(X \in \{0, 1\}\), it’s helpful to remember the “fundamental bridge” between expectations and probability
\[\mathbb{E}[X] = Pr(X = 1)\]
Variance
We also care about the spread of a random variable - how far is the average draw of \(X\) from its mean \(\mathbb{E}[X]\).One measure of this is the variance.
\[Var(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]\]
Also written as
\[Var(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2\]
- Note that the square is a convex function. Which means that by Jensen’s inequality \(\mathbb{E}[X^2] \ge \mathbb{E}[X]^2\). Variances cannot be negative!
- We also can define a covariance between two variables (does \(X\) take high values when \(Y\) takes high values?)
\[Cov(X, Y) = E\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\]
Variance
- Variances also have some useful properties.
- For a constant \(a\)
\[Var(aX) = a^2Var(X)\] - For any two random variables \(X\) and \(Y\)
\[Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)\] \[Var(X - Y) = Var(X) + Var(Y) - 2Cov(X,Y)\]
- For independent random variables \(X\) and \(Y\)
\[Var(X + Y) = Var(X) + Var(Y)\] \[Var(X - Y) = Var(X) + Var(Y)\]
Conditional probabilities
- We will also spend a lot of time with conditional distributions and conditional expectations of random variables.
- What’s the probability that an individual enrolls in a job training program given their income?
- We represent the conditioning set using a vertical bar with the right-hand side denoting what is being conditioned on.
- For example: \(Pr(D_i = 1 | X_i = x)\)
Conditional probabilities
Key concept - Dependence and independence. If two variables are independent, the distribution of one does not change conditional on the other. We’ll write this using the \(\perp \!\!\! \perp\) notation.
- \(Y_i \perp \!\!\! \perp D_i\) implies
\[f(Y_i | D_i = 1) = f(Y_i| D_i = 0) = f(Y_i)\]
Two variables can be conditionally independent in that they are independent only when conditioning on a third variable. For example, we can have \(Y_i \cancel{\perp \!\!\! \perp} D_i\) but \(Y_i \perp \!\!\! \perp D_i | X_i\). This implies
\[f(Y_i| D_i = 1, X_i = x) = f(Y_i| D_i = 0, X_i = x) = f(Y_i | X_i =x)\]
Remember: Conditional independence does not imply independence or vice-versa!
Conditional expectations
A central object of interest in statistics is the conditional expectation function (CEF) \(\mathbb{E}[Y | X]\).
- Given a particular value of \(X\), what is the expectation of \(Y\)?
- The CEF is a function of \(X\).
All the usual properties of expectations apply to conditional expectations.
- We also will often make use of the law of total expectation
\[\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y|X]]\]
Easiest to think about this in terms of discrete r.v.s
\[\mathbb{E}[Y] = \sum_{x \in \mathcal{X}} \mathbb{E}[Y | X = x] Pr(X = x)\]
Estimation
- One critical use of statistical theory is understanding how to learn about things we don’t observe using things that we do observe. We call this estimation.
- e.g. What is the share of voters in Wisconsin who will turn out in the 2026 election?
- What is the share of voters who turn out among those assigned to receive a GOTV phone call?
- Estimand: The unobserved quantity that we want to learn about. Often denoted via a greek letter (e.g. \(\mu\), \(\pi\))
- Often a “population” characteristic that we want to learn about via a sample.
- Although recall causal estimands can’t be fully observed even in a finite sample!
- Important to define your estimand well. (Lundberg, Johnson and Stewart, 2022)
Estimation
- Estimator: The function of random variables that we will use to try to estimate the quantity of interest. Often denoted with a hat on the parameter of interest (e.g. \(\hat{\mu}\), \(\hat{\pi}\))
- Why are the variables random?
- Classic inference: We have a random sample from the population – if we took another sample, we would obtain a different realization of our estimator.
- Randomization inference: We have a randomly assigned treatment – if we were to re-run the experiment, we would observe a different treatment/control allocation.
- Estimate: A single realization of our estimator (e.g. 0.3, 9.535)
- We often report both point estimates (“best guess”) and interval estimates (e.g. confidence intervals).
- Careful not to confuse properties of estimators with properties of the estimates themselves.
Estimation
- The classic estimation problem in statistics is to estimate some unknown population mean \(\mu\) from an i.i.d. sample of \(n\) observations \(Y_1, Y_2, \dotsc, Y_n\).
- We assume that each \(Y_i\) is a draw from the target population with mean \(\mu\). (identically distributed) – therefore \(\mathbb{E}[Y_i] = \mu\)
- We’ll also assume that knowing \(Y_i\) tells us nothing about any other \(Y_j\) \(Y_i \perp \!\!\! \perp Y_j\) (independently distributed) – this implies \(Cov(Y_i, Y_j) = 0\)
- Our estimand: \(\mu\)
- Our estimator: The sample mean \(\hat{\mu} = \bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i\)
- Our estimate: A particular realization of that estimator based on our observed sample (e.g. \(0.4\))
Estimation
- Note that our estimator is a random variable – it’s a function of \(Y_i\)s which are random variables.
- Therefore it has an expectation \(\mathbb{E}[\hat{\mu}]\) (assuming \(Y_i\) has an expectation)
- It has a variance \(Var(\hat{\mu})\) (again, under regularity conditions)
- It has a distribution (which we may or may not know).
Unbiasedness
Is the expectation of \(\hat{\mu}\) equal to \(\mu\)?
\[\mathbb{E}[\hat{\mu}] = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n Y_i\right] = \frac{1}{n}E\left[\sum_{i=1}^n Y_i\right]\]
Next we use linearity of expectations
\[\frac{1}{n}\mathbb{E}\left[\sum_{i=1}^n Y_i\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[Y_i\right]\]
Finally, under our i.i.d. assumption
\[\frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[Y_i\right] = \frac{1}{n}\sum_{i=1}^n \mu = \frac{n \mu}{n} = \mu\]
Therefore, the bias, \(\text{Bias}(\hat{\mu}) = \mathbb{E}[\hat{\mu}] - \mu = 0\)
Variance
What is the variance of \(\hat{\mu}\)? Again, start by pulling out the constant.
\[Var(\hat{\mu}) = Var\left[\frac{1}{n}\sum_{i=1}^n Y_i\right] = \frac{1}{n^2}Var\left[\sum_{i=1}^n Y_i\right]\]
We can further simplify by using our i.i.d. assumption. The variance of a sum of i.i.d. random variables is the sum of the variances.
\[\frac{1}{n^2}Var\left[\sum_{i=1}^n Y_i\right] = \frac{1}{n^2}\sum_{i=1}^n Var\left[Y_i\right]\]
“identically distributed”
\[\frac{1}{n^2}\sum_{i=1}^n Var\left[Y_i\right] = \frac{1}{n^2}\sum_{i=1}^n \sigma^2 = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}\]
Therefore, the variance is \(\frac{\sigma^2}{n}\)
Asymptotic behavior
As \(n\) gets large, what can we say about the estimator \(\hat{\mu}\).
First, we can show that it is consistent – it converges in probability to the true parameter \(\mu\)
- Unbiasedness + Variance that goes to \(0\) as \(n\) gets large.
- Some estimators may be biased but have bias terms that go to \(0\) – if variance also goes to \(0\) these are still consistent.
Second, we can say something about the distribution of \(\hat{\mu}\).
- Remember, we’ve only made assumptions about \(\mathbb{E}[Y_i]\) and \(Var(Y_i)\) (that they exist). We have made no assumptions on the distribution of \(Y_i\). \(Y_i\) can be normal, poisson, bernoulli, or whatever!
- However, we know something about sums and means of random variables – they are well-approximated by a normal distribution. The Central Limit Theorem!
- So in large samples, the sampling distribution of \(\hat{\mu}\) is close to normal. This lets us construct confidence intervals and do inference with this approximation and be confident that we won’t be far off!
Regression review
- Rather than just estimating a population mean, we are more typically interested in some population conditional expectation \(\mathbb{E}[Y|X]\)
- \(Y_i\): Outcome/response/dependent variable
- \(X_i\): Vector of regressor/independent variables
- “How does the expected value of \(Y\) differ across different values of \(X\)?”
- Suppose we observe \(N\) paired observations of \(\{Y_i, X_i\}\).
- How do we construct a “good” estimator of \(\mathbb{E}[Y|X]\)?
- What assumptions do we have to make to get…consistency…unbiasedness…efficiency?
Regression review
Consider the ordinary least squares estimator \(\hat{\beta}\) which solves the minimization problem:
\[\hat{\beta} = \argmin_b \ \sum_{i=1}^N (Y_i - X_ib)^2\]
We can do some algebra and find a closed form solution for this optimization problem
\[\hat{\beta} = (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\]
Regression review
Assumption 1: Linearity
\[Y = \mathbf{X}\beta + \epsilon\]
Assumption 2: Strict exogeneity of the errors
\[\mathbb{E}[\epsilon | \mathbf{X}] = 0\]
These two imply:
\[\mathbb{E}[Y|\mathbf{X}] = \mathbf{X}\beta = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + \dotsc \beta_kX_{k}\]
Best case: Our CEF is truly linear (by luck or we have a saturated model)
Usual case: We’re at least consistent for the best linear approximation to the CEF
Regression review
Assumption 3: No perfect collinearity
- \(\mathbf{X}^{\prime}\mathbf{X}\) is invertible
- \(\mathbf{X}\) has full column rank
This assumption is needed for identifiability – otherwise no unique solution to the least squares minimization problem exists!
Fails when one column can be written as a linear combination of the others
- Or when there are more regressors than observations \(k > n\)
Regression review
- Under assumptions 1-3, our OLS estimator \(\hat{\beta}\) is unbiased and consistent for \(\beta\)
- Let’s do a quick proof for unbiasedness
\[\begin{align*}\hat{\beta} &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\\
&= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}(\mathbf{X}\beta + \epsilon))\\
&= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{X})\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\\
&= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)
\end{align*}\]
- Then we can obtain the conditional expectation of \(\mathbb{E}[\hat{\beta} | \mathbf{X}]\)
\[\begin{align*} \mathbb{E}[\hat{\beta} | \mathbf{X}] &= \mathbb{E}\bigg[\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \bigg| \mathbf{X} \bigg]\\
&= \mathbb{E}[\beta | \mathbf{X}] + \mathbb{E}[(\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) | \mathbf{X}]\\
&= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} \mathbb{E}[\epsilon | \mathbf{X}]\\
&= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}0\\
&= \beta
\end{align*}\]
Regression review
Lastly, by law of total expectation
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\mathbb{E}[\hat{\beta}|\mathbf{X}]]\]
Therefore
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\beta] = \beta\]
Consistency requires us to show the convergence of \((\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\) to \(0\) in probability as \(N \to \infty\).
- This actually requires weaker assumptions: \(\mathbb{E}[\mathbf{X}^{\prime}\epsilon] = 0\) but not necessarily \(\mathbb{E}[\epsilon | \mathbf{X}] = 0\).
But what have we not assumed?
- Anything about the distribution of the errors!
Regression review
- Assumption 4 - Spherical errors
\[Var(\epsilon | \mathbf{X}) = \begin{bmatrix}
\sigma^2 & 0 & \cdots & 0\\
0 & \sigma^2 & \cdots & 0\\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \sigma^2
\end{bmatrix} = \sigma^2 \mathbf{I}\]
- Benefits
- Simple, unbiased estimator for the variance of \(\hat{\beta}\)
- Completes Gauss-Markov assumptions \(\leadsto\) OLS is BLUE (Best Linear Unbiased Estimator)
- Drawbacks
Regression review
Good news! We can relax homoskedasticity (but still keep no correlation) and do inference on the variance of \(\hat{\beta}\)
\[Var(\epsilon | \mathbf{X}) = \begin{bmatrix}
\sigma^2_1 & 0 & \cdots & 0\\
0 & \sigma^2_2 & \cdots & 0\\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \sigma^2_n
\end{bmatrix}\]
“Robust” standard errors using the Eicker-Huber-White “sandwich” estimator - Consistent but not unbiased for the true sampling variance of \(\hat{\beta}\)
\[\widehat{Var(\hat{\beta})} = (\mathbf{X}^{\prime}\mathbf{X})^{-1} \mathbf{X}^{\prime}\hat{\Sigma}\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}\]
- \(\hat{\Sigma}\) is our estimate of the variance-covariance matrix using the squared residuals on the diagonals
- Extensions to “clustered” standard errors that allow arbitrary correlation within groups.
Regression review
- Assumption 5 - Normality of the errors
\[\epsilon | \mathbf{X} \sim \mathcal{N}(0, \sigma^2)\]
- Not necessary even for Gauss-Markov assumptions
- Not needed to do asymptotic inference on \(\hat{\beta}\)
- Why? Central Limit Theorem!
- Benefits?
Regression review
- What do we need for OLS to be consistent for the “best linear approximation” to the CEF?
- What do we need for OLS to be consistent and unbiased for the conditional expectation function?
- Truly linear CEF
- But still no assumptions about the outcome distribution!
- What do we need to do inference on \(\hat{\beta}\)?
- We almost never assume homoskedasticity because “robust” SE estimators are ubiquitous
- Even some forms of error correlation are permitted (“cluster” robust SEs)
- Sample sizes are usually large enough where Central Limit Theorem implies a normal sampling distribution is a reasonable approximation.